Abstract:

You've inherited a live code base - actually, a web site, and it keeps crashing all the time. The pager keeps going off incessantly. If you're really unlucky, this is a US/India company, and peak failure times are during the US daytime. Engineering wants to rewrite to fix, but it will take 6 months - optimistically!! What the hell do you do to retain your sanity?

I will talk about my experience going through a couple of such project takeovers when I was an Engineering Manager at Yahoo!. I will cover the following topics:

A quick overview of the problems we faced
Staunch the bleeding! Ensure better monitoring and uptime.
Triage - identify the dead, the critically wounded, and the minor injuries in the system, for fixing.
Fight the rewrite!
What and how to monitor
Lessons learned, and what to do for new projects.

Proposer:

  • Vijay Ramachandran - Wisdom Camp
blog comments powered by Disqus